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Problem 

The  present  study  was  conducted  to  explicate  the  sampling  problems 
involved  in  obtaining  data  for  decisions  about  the  acceptability  of  frames  in 
self-instructional  programs.   In  this  study,  data  were  examined  to  determine 
the  relative  efficiency  of  various  sample  sizes  in  making  decisions  about 
retaining  or  revising  the  frames  of  a  self-instructional  program.   Individual 
frames  are  not  assumed  to  be  statistically  independent  of  one  another.   The 
errors  made  on  one  frame  are  likely  to  be  correlated  with  those  made  on  another 
and  the  magnitude  of  the  correlations  is  generally  both  unknown  and  variable 
throughout  a  program.   This  state  of  affairs  indicates  the  need  for  an 
empirical  study  of  sampling  problems  in  decision  making.   Two  indices  were 
used  to  determine  the  implications  of  selection  criteria  and  sampling  procedures 
in  deciding  about  the  acceptability  of  frames.   One  index  was  the  percent  of 
undesirable  frames  correctly  identified.   The  other  was  the  percent  of 
rejections  made  erroneously  (type  I  errors) .   The  study  also  was  conducted  to 
suggest  guidelines  for  determining  the  sample  size  to  use  in  developing  self- 
instructional  programs  in  which  it  is  expected  that  the  overall  error  rates 
will  be  low  and  the  distribution  of  error  rates  observed  for  frames  within 
the  program  will  be  skewed. 


Currently  Associate  Professor  of  Psychology  at  Sacramento  State  College, 
California. 
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Docision  Making  and  Types  of  Error 

In  the  course  of  evaluating  and  revising  programed  instructional 
materials,  a  programer  may  decide  to  reject  as  undesirable  those  frames  in 
the  program  which  he  suspects  will  lea''  to  a  student  error  rate  above  a  given 
criterion  value.   In  doing  this,  the  progrr.mer  is  in  a  position  similar  to 
that  of  a  statistician  f-'.ced  with  a  large  number  of  hypotheses  to  test.   Each 
frame  in  the  program  must  be  accepted  or  rejected.   In  effect,  for  each  frame, 
the  programer  must  test  the  null  hypothesis  that  the  frame  will  have  an 
error  rate  below  that  value  he  has  chosen  as  his  minimum  criterion  for 
"undesirableness"  when  the  program  is  put  into  general  use.   The  number  of 
observations  on  v/hich  each  test  of  this  hypothesis  is  based  will  be  equal  to 
the  number  of  students  with  whom  the  program  is  pretested.   Usually  (though 
not  necessarily)  the  programer  rejects  the  null  hypothesis  whenever  a  frame 
is  found  to  have  an  error  rate  greater  than  the  one  chosen  as  the  criterion 
value . 

In  making  each  decision  concerning  the  acceptability  of  a  program  frame, 
the  programer,  as  statistician,  may  make  one  of  two  types  of  errors.   He 
will  make  a  type  I  error  if  the  null  hypothesis  is  rejected  when  it  is  true; 
he  will  mislabel  an  acceptable  frame  as  "unacceptable."  He  will  make  a  type 
II  error  if  the  null  hypothesis  is  accepted  when  it  actually  is  false;  he  will 
mislabel  an  unacceptable  frame  as  "acceptable ."  The  programer  definitely 
wants  to  avoid  type  II  errors.   Ke  wants  what  the  statistician  calls  a 
"powerful"  test  of  the  null  hypothesis,  one  that  has  enough  "power"  to  reject 
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each  truly  unacceptable  frame.  At  the  same  time,  however,  the  programer  does 
not  want  to  make  too  many  type  I  errors.   Otherwise,  he  will  be  needlessly 
rewriting  a  large  number  of  the  acceptable  frames  in  the  program.   Extensive 
rewriting  would  then  require  that  the  revised  program  be  pretested,  and  thus 
delay  and  increase  the  cost  of  the  finished  program. 

Power  and  Sample  Size 

Given  a  fixed  number  of  observations  (N),  the  only  way  to  change  the 
probability  of  a  type  I  error  is  to  shift  the  rejection  criterion.   Unfortunately, 
in  this  situation  a  shift  of  the  rejection  criterion  which  reduces  chances 
of  a  type  I  error  simultaneously  reduces  the  power  of  the  test.   Conversely,  a 
shift  of  the  criterion  which  increases  the  power  of  the  test  results  in  the 
increased  probability  of  a  type  I  error.   Only  by  increasing  N  can  the  power 
of  the  test  be  increased  without  a  simultaneous  increase  in  the  probability  of 
a  type  I  error.   Similarly,  only  by  increasing  N  can  the  probability  of  a  type 
I  error  be  reduced  without  diminishing  the  power  of  the  test. 

Since  each  test  is  based  on  the  error  data  of  students  utilized  in  pre- 
testing the  program,  the  power  of  each  test  will  be  inversely  related  to  the 
performance  level  or  overall  error  rate  of  the  N  students  comprising  the 
sample.   If  the  N  students  perform  quite  well,  observed  error  rates  will  be 
depressed  for  all  items.  Consequently,  fewer  unacceptable  items  will  be 
rejected.  Conversely,  if  the  N  students  perform  quite  poorly,  observed  error 
rates  will  be  higher  and  more  frames  (both  acceptable  and  not)  will  be 
rejected.  This  suggests  the  desirability  of  obtaining  measures  characterizing 
the  sample  of  students  used  in  terms  of  relevant  abilities,  and  of  obtaining 
measures  of  their  representativeness  in  terms  of  relevant  academic  achievement. 
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True  Error  Rate 

The  power  of  each  test  made  by  the  programer  also  depends  upon  the 
degree  to  which  the  actual  state  of  affairs  approaches  that  stated  in  the  null 
hypothesis.  We  can  speak  of  the  true  error  rate  of  a  frame  as  that  error 
rate  which  would  be  found  for  it  if  it  were  given  to  all  of  the  intended 
population  of  students  as  a  part  of  the  finished  program.   If  the  "unacceptable' 
frame  has  a  very  high  "true"  error  rate,  the  test  based  on  pretesting  data 
will  be  more  likely  to  reject  it  than  an  unacceptable  frame  with  a  "true" 
error  rate  which  approaches  acceptability.   "True"  error  rates  are  generally 
unknown  for  program  frames.   This  lack  of  knowledge  is  what  leads  the  programer 
to  construct  a  test  for  selecting  frames  in  the  first  place.  All  the  programer 
can  do,  therefore,  is  accept  the  fact  that  his  test  will  be  more  effective  in 
eliminating  the  extremely  unacceptable  frames  if  it  is  at  the  cost  of 
retaining  some  "borderline"  unacceptable  frames.  However,  the  percentage  of 
truly  unacceptable  frames  rejected  by  tests  based  on  data  from  a  given 
pretesting  sample  of  size  N  will  not  be  expected  to  be  very  high  in  the  case 
of  programs  which  have  already  gone  through  successful  revision. 

Method 

Materials 

The  program.   UICSM's  programed  instructional  series,  Part  110,  was 
selected  for  study  since  it  teaches  relatively  difficult  concepts  and  thus 
some  variability  in  error  rates  for  frames  could  be  expected.  Worksheet  error 


' 


' 


2 
data  fox-  178  of  the  students  who  completed  Part  110  (pure  mode)   was  recorded 

3  4 

on  SCRIBE   sheets  and  used  with  the  SCRIBE  system  to  produce  IBM  cards 

containing  individual  error  data  for  each  student.   Error  data  from  the  completed 

worksheets  of  the  52  additional  pure-mode  subjects  was  punched  directly  into 

IBM  cards  in  the  SCRIBE  output  format.   Thus,  data  from  230  students  was 

available  for  sampling. 

Sampling  procedure .  All  sampling  of  data  from  the  basic  pool  of  230 

students  was  done  without  replacement.  A  random  sample  of  100  students  from 

all  seven  classes  in  four  participating  schools  was  selected  to  establish 

criterion  error  rate  measures  on  the  308  program  frames  in  UICSM  PIP  Part  110. 

This  sample  of  100  was  stratified  with  regard  to  classes  and  schools.  A 

similar  stratified  random  sampling  procedure  was  then  followed  as  closely  as 

possible  in  selecting  several  other  samples  of  various  sizes.   In  cases  where 

a  stratified  procedure  was  clearly  impossible,  such  as  in  a  sample  of  size  1 

(N=l),  ordinary  random  sampling  was  utilized.   Three  samples  each  of  the 

following  sizes  were  selected;   N=l,  N=2,  N=3,  N=4,  N=5,  and  N=15 .   These 

samples  were  also  merged  to  form  a  "summation  sample"  with  N=120. 


2 
See  Beberman,  N.  and  Stolurow,  L.  M.   Comparative  studies  of  ^linciples 

for  programing  mathematics  for  programed  instruction.   Semi-annual  report  for 

description  of  the  modes,  schools  and  classes  involved  in  the  1962-63  tryout 

for  UICSM  programs.   Urbana,  111.:   Univer.  of  111.,  1963. 

3 
SCRIBE  is  a  system  developed  and  used  by  ETS  to  score  multiple- choice 

answer  sheets  and  to  automatically  transcribe  the  data  on  the  answer  sheets 

to  IBM  cards. 

4 
The  SCRIBE  technique  of  recording  and  processing  worksheet  data  is 

described  in  Frincke,  G.  L.  and  Stolurow,  L.  M.   Three  methods  of  recording 
worksheet  performance.   Urbana,  111.:   Univer.  of  111.,  1964.   USOE  Title 
VII,  Technical  Report  No.  7.   This  work  was  done  in  cooperation  with  Educa- 
tional Testing  Service  and  arrangements  for  it  were  made  by  Dr.  Paul  Jacobs. 
The  ETS  contribution  was  made  possible  by  a  grant  from  the  Carnegie 
Corporation  of  New  York. 
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Item  analyses.   Worksheet  error  data  cards  prepared  from  the  worksheets 
of  the  subjects  who  constituted  the  criterion  sample,  the  summation  sample, 
and  the  21  smaller  samples,  were  used  as  the  basis  for  23  separate  item- 
analyses  of  the  308  items  which  comprise  the  UICSM  Programed  Instruction  Part 
110. 

Three  hundred  and  eight  summary  IBM  cards  were  then  prepared.   Each  of 
these  cards  contained  the  results  of  all  23  analyses  with  regard  to  one  of 
the  308  program  items.   The  summary  cards  were  used  to  determine  correlations 
between  the  results  of  item  analyses  based  on  the  various  samples.   These 
cards  were  also  used  to  determine  which  frames  would  be  rejected  and  which 
accepted  by  tests  based  on  the  data  of  the  various  samples  in  cases  where  the 
criterion  for  rejection  would  be  the  observed  error  rates  equal  to  or  greater 
than  10%,  15%,  or  20%. 

Results 

Figure  1  is  a  frequency  distribution  of  the  different  error  rates 
observed  in  Part  110  for  the  criterion  sample  (N=100) .   Table  1  shows  the 
distribution  of  error  rates  observed  in  the  21  smaller  samples.  All  of  the 
distributions  are  quite  skewed.   Most  items  are  well  within  the  limits  of 
acceptability  and  the  number  of  extremely  unacceptable  items  in  Part  110  is 
actually  quite  low.   This  is  an  important  factor  in  interpreting  the  findings 
of  this  study. 


5 

These  analyses  were  carried  out  with  the  aid  of  an  IBM  1620  computer 

and  a  program  written  by  Mr.  Scott  Krueger,  University  of  Illinois,  Training 
Research  Laboratory. 
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Table  2  presents  correlations  between  the  error  rates  oi'  the  308  items 
as  determined  by  each  of  the  samples  and  error  rates  based  on  the  criterion 
sample.   These  correlations  are  generally  quite  low  as  might  be  expected  due 
to  the  small  range  of  error  rates  that  obtained  and  the  extreme  skewness  of 
the  error  rate  distributions. 

Estimates  of  the  overall  program  error  rate  based  on  the  21  independent 
samples  also  are  presented  in  Table  1.   Considerable  variability  among  the 
estimates  based  on  the  smaller  samples  can  be  seen.   For  example,  those  for 
samples  of  size  1  range  from  2.5%  to  7 .1%,  those  of  size  5  range  from  4.8% 
to  11.7%  overall  error  rate.   Figure  2  shows  the  mean  overall  error  rate 
estimates  for  each  of  the  sample  sizes  in  relation  to  the  overall  error  rate 
observed  in  the  criterion  sample.  All  of  these  overall  error  rates  appear  to 
fluctuate  randomly  about  the  criterion  value  with  the  exception  of  that  based 
on  the  samples  of  N=5 .   Here  a  very  excessive  and  presumably  spurious  error 
rate  was  observed. 

Figure  3  depicts  the  efficiency  of  the  seven  sample  sizes  by  showing  the 
relationship  between  the  mean  percentage  of  unacceptable  frames  that  were 
rejected  and  sample  size.   It  shows  that  all  sample  sizes  greater  than  N=4, 
with  the  exception  of  the  sample  size  N=10  at  the  15%  criterion  level,  rejected 
50%  or  more  of  the  unacceptable  items.   In  all  cases,  a  rapid  rise  in  the  mean 
percentage  of  the  unacceptable  items  correctly  identified  for  rejection  can 
be  seen  in  relation  to  the  sample  size  as  it  increases  from  1  to  5.  Each  line 
represents  a  different  criterion.   In  the  case  of  a  107o  error  rate  criterion, 
the  efficiency  of  the  sample  size  increases  up  to  size  10.  With  a  15%  or  20% 
error  rate,  the  efficiency  of  the  sample  in  terms  of  the  rejection  of  unacceptable 
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Tabic  2 


Correlations  of  Sample  Item  analyses  V/ith  Criterion 
Item  Analyses  as  a  function  of  Sample  ^ize 


Sample 


No. 


Size 


Estimate  of 

program 

errorrtrate 
in  7o 


Correlation 

with 

criterion 


0" 

roc 

1 

i 

2 

i 

3 

i 

'1 

^ 

S 

2 

6 

2 

7 

3 

8 

3 

9 

3 

10 

'1 

11 

4 

12 

4 

13 

5 

14 

5 

15 

5 

16 

10 

17 

10 

18 

10 

19 

15 

20 

15 

22b 

15 

120 

5.6 
7.1 
2.5 
3.6 
0.0 
6.9 
3.6 
5.0 
'1.5 
5.6 
2^0 
8.7 
G.4 
11.7 
7.2 
4  .3 
3.0 
8.0 
5.5 
3.7 
4.5 
5.1 
5.5 


1.000 
.329 
.280 
.101 
.356 
.297 
.318 
.272 
.352 
.302 
.455 
.481 
.462 
.453 
.470 
.535 
.567 
.514 
.552 
.600 
.711 

.6:1 

.872 


Criterion  sample, 


Summation  sample 


Analysis  is  based  upon  308  frames  of  UICST.l  PIP  Part  110. 
All  correlations  except  .101  are  significant  beyond 
the  .01  level. 
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items  is  seen  to  first  increase  to  sample  size  5  even  more  rapidly  than  the 
curve  for  the  10%  criterion.   However,  these  curves  decrease  as  sample  size  is 
increased  beyond  N=5  and  then  increase  as  sample  size  goes  from  15  to  120. 
Table  3  presents  both  the  percentage  and  number  of  unacceptable  items  correctly 
identified  by  each  of  the  samples  of  subjects.   It  should  be  noted  that  for 
a  10%  criterion,  for  example,  all  percentages  are  above  50  only  for  samples 
of  N=5  or  more.   However,  with  a  15%  criterion  not  all  the  samples  of  size  15 
resulted  in  507o  or  greater  correct  identifications  of  unacceptable  frames. 

Table  4  shows  the  number  and  percentage  of  the  acceptable  items  that 
would  be  rejected  by  each  of  the  samples  using  the  various  criteria.   This 
data  has  been  combined  with  that  of  Table  3  to  produce  Figure  4.   Figure  4 
depicts  erroneous  rejections  in  terms  of  their  percentage  of  all  rejections 
made  and  thus  serves  to  indicate  the  "cost"  of  each  correct  rejection  in 
terms  of  incorrectly  rejected  items. 

Discussion 

Skewness  Effects 

The  extreme  positive  skewness  of  the  error  rate  distribution  appears  to 
affect  the  efficiency  of  a  given  sample  size  in  leading  to  the  rejection  of 
unacceptable  frames.   It  seems  to  have  lowered  the  efficiency  of  N  when  the 
rejection  criterion  was  reduced  from  a  20%  error  rate  to  a  10%  error  rate. 
A  shift  in  the  rejection  criterion  of  this  sort  moves  the  point  of  rejection 
to  a  place  in  the  frequency  distribution  where  considerably  more  frames  are 
accumulated,  since  the  frames,  in  general,  tend  to  produce  small  numbers  of 
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Table  3 

The  Percentage  and  Number  of  Unacceptable  Frames  that  Would  be  Rejected 
as  a  Function  of  the  Size  of  Sample  and  the 
Criterion  Error  Rate  Used  for  Rejection 


Sample 


Criterion  error  rate  used  for  rejection 


Size 


No. 


20% 


15% 


10% 


No. 


No. 


No. 


100 


100 


11 


100 


24v 


100 


63 


1 

1 

45 

5 

29 

7 

19 

12 

1 

2 

18 

2 

17 

4 

10 

6 

1 

3 

18 

2 

8 

2 

5 

3 

2 

4 

45 

5 

46 

11 

38 

24 

2 

5 

55 

6 

38 

9 

30 

19 

2 

6 

36 

4 

21 

5 

22 

14 

3 

7 

36 

4 

29 

7 

29 

18 

3 

8 

64 

7 

42 

10 

32 

20 

3 

9 

36 

4 

54 

13 

35 

22 

4 

10 

45 

5 

50 

12 

25 

16 

4 

11 

64 

7 

79 

19 

62 

39 

4 

12 

73 

8 

67 

16 

63 

40 

5 

13 

100 

11 

92 

22 

68 

43 

5 

14 

91 

10 

58 

14 

57 

36 

5 

15 

100 

11 

88 

21 

52 

33 

10 

16 

64 

7 

42 

10 

56 

35 

10 

17 

45 

5 

38 

9 

78 

49 

10 

18 

64 

7 

62 

15 

70 

44 

15 

19 

45 

5 

29 

7 

51 

32 

15 

20 

82 

9 

58 

14 

51 

32 

15 

21 

45 

5 

92 

22 

52 

33 

120 

22 

73 

8 

62 

15 

68 

43 

(Summat 

ion  Sample) 

Criterion  sample 


11  frames  were  unacceptable  at  this  criterion  level, 
'24  frames  were  unacceptable  at  this  criterion  level 
63  frames  were  unacceptable  at  this  criterion  level, 


00 


15 


Table  4 

The  Percentage  and  Number  of  Acceptr.ble  Frames  that  Would  be  Erroneously 
Rejected  as  a  Function  of  the  Size  of  Sample  and  the 
Criterion  Error  Rate  Used  for  Rejection 


Sample 


Criterion  error  rate  used  for  rejection 


Size 


No. 


20% 


15% 


10% 


No. 


No. 


No 


100 


1 

1 

5.7 

17 

5.3 

15 

4.1 

10 

1 

2 

2.0 

6 

1.4 

4 

0.8 

2 

1 

3 

3.4 

10 

3.5 

10 

3.7 

9 

2 

4 

14.1 

42 

12.7 

36 

9.4 

23 

2 

5 

12.1 

36 

11.6 

33 

9.4 

23 

2 

6 

5.7 

17 

5.3 

15 

2.9 

7 

3 

7 

12.5 

37 

12.0 

34 

9.4 

23 

3 

S 

10.1 

30 

9.5 

27 

6.9 

17 

3 

9 

15.8 

47 

13.4 

38 

11.8 

29 

4 

10 

5.1 

15 

2.8 

8 

1.6 

4 

4 

11 

28.3 

84 

25.7 

73 

21.6 

53 

4 

12 

27.6 

82 

26.1 

74 

20.4 

50 

5 

13 

40.4 

120 

38.4 

109 

35.9 

88 

5 

14 

26.3 

78 

26.1 

74 

21.2 

52 

5 

15 

18.9 

56 

16.2 

46 

13.9 

34 

10 

16 

6.7 

20 

6.0 

17 

19.6 

48 

10 

17 

17.2 

51 

16.5 

47 

43.7 

107 

10 

18 

10.8 

32 

8.5 

24 

30.2 

74 

15 

19 

2.4 

7 

1.8 

5 

2.4 

6 

15 

20 

3.7 

11 

2.1 

6 

6.1 

15 

15 

21 

2.7 

8 

1.8 

5 

11.0 

27 

120 


22       L  0 

(Sunnation  Sample)* 


3.2 


4.1 
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Criterion  sample. 

297  frames  were  acceptable  at  this  criterion  level. 
'284  frames  were  acceptable  at  this  criterion  Level. 
245  frames  were  acceptable  at  this  criterion  level. 
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Since  the  power  of  the  test  is  lowest  for  frames  which  have  true  error  rates 
that  are  almost  acceptable,  the  overall  efficiency  of  the  tests  of  frame 
acceptability  is  lowered  in  this  situation.   One  factor,  however,  works  to 
counteract  this  reduction  in  efficiency  to  some  extent.   It  is  clear  from  the 
formula  for  the  standard  error  of  a  proportion  ((T"  =  U  P(l-P)   that  as  the 
proportion  (p)  is  shifted  away  from  15,  the  standard  error  is  reduced.   Thus, 
the  standard  error  of  the  error  rates  (proportion  of  errors)  which  are  observed 
for  borderline  unacceptable  items  will  be  reduced  when  the  point  of  rejection 
is  shifted  further  away  from  the  50%   error  rate  value.  As  can  be  seen  from 
the  formula,  this  effect  rapidly  becomes  less  important  as  N  is  increased. 

The  mean  overall  error  rates  observed  for  the  various  sample  sizes  is 
depicted  in  Figure  2.   These  must  be  considered  when  interpreting  the  efficiency 
curves  in  Figure  3.   Of  major  concern  here  is  the  fact  that  the  mean  overall 
error  rate  of  the  samples  with  N=5  was  considerably  above  that  of  the  criterion. 
Thus,  the  efficiency  curves  in  Figure  3  are  higher  for  N=5  than  would  normally 
be  expected.   The  same  curves  are  somewhat  depressed  at  the  point  where  N=15 
due  to  the  fact  that  the  mean  overall  error  rate  happened  to  be  lowest  for 
samples  of  this  size. 

Two  major  relationships  are  illustrated  in  Figure  3.   The  fir£t  is  that 
in  detecting  unacceptable  frames,  samples  with  N=5  or  N=10  can  approach  quite 
closely  the  efficiency  of  much  larger  samples.   Inspection  of  Table  3,  however, 
shows  that  while  the  mean  efficiency  of  the  smaller  samples  is  high, 
variability  in  efficiency  is  also  quite  high.   Thus,  the  programer  may  be 
able  to  approach  the  efficiency  of  a  very  large  sample  with  only  a  small  one, 
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but  he  runs  a  definite  risk  of  drawing  a  very  poor  sample.   It  should  be 
noted  that  even  the  summation  sample  leaves  a  great  deal  to  be  desired  in  the 
identification  of  faulty  frames. 

While  it  is  possible  to  identify  three-fourths  or  more  of  the  unacceptable 
frames  with  samples  as  small  as  10  when  a  10%  criterion  is  used,  sampling 
fluctuations  are  great  enough  that,  in  this  study,  no  sample  of  size  15 
was  this  efficient.   The  data  indicate  that  when  a  higher  criterion  is  used, 
smaller  samples  may  identify  three-fourths,  or  more,  of  the  unacceptable 
frames,  and  that  this  could  occur  with  samples  as  small  as  4.  While  it  may 
not  occur  with  even  larger  samples,  the  chances  are  greater  that  it  will. 

Sample  Size  and  Rejection  Criteria 

A  second  relationship,  shown  in  Figure  3,  involves  sample  size  and  the 
rejection  criterion.   The  curve  for  the  10%  criterion  reaches  its  first 
maximum  when  N=10,  while  the  curve  for  the  20%  criterion  reaches  its  first 
maximum  at  N=5.   This  is  due  to  the  fact  that  with  these  Ns  and  these 
criteria,  N  is  as  large  as  possible  at  the  points  where  one  subject  missing 
a  frame  will  cause  it  to  be  rejected.   In  other  words,  the  lowest  possible 
error  rate  other  than  0%  which  can  be  observed  in  these  samples  is  equal  to 
the  rejection  criterion  at  these  points.   This  is  not  true  for  samples  with 
slightly  smaller  or  slightly  larger  Ns.   For  example,  when  the  criterion 
for  rejection  was  a  20%  error  rate  and  N=3,  the  only  possible  observed  error 
rates  were  0%,  33%,  67%  and  100%.   Thus,  if  a  frame  was  to  be  rejected,  33% 


19 

or  more  of  the  students  had  to  miss  it.   This  means,  in  effect,  that  actions 
taken  in  making  a  decision  to  accept  or  reject  a  frame  were  the  same  as  if  the 
rejection  criterion  was  shifted  to  an  observed  error  rate  of  337o,  while  the 
null  hypothesis  being  tested  continued  to  be  that  the  frame  has  an  error  rate 
less  than  20%.   The  effect  of  this  de  facto  rejection  criterion  shift  is  a 
reduction  in  the  power  of  the  test  of  this  hypothesis  (the  probability  of  a 
type  I  error  was  reduced,  however).  When  N=5  in  this  case,  possible  observed 
error  rates  were  0%,  20%,  40%,  60%,  80%  and  100%.   Thus,  no  de  facto  shift 
in  the  rejection  criterion  occurred,  since  an  observed  error  rate  of  20% 
led  to  rejection  of  the  item.   Increasing  N  from  N=5  to  N=6  would,  according 
to  the  same  principle,  reduce  the  power  of  the  test.  With  N=6,  possible 
observed  error  rates  are  0%,  17%,  33%,  50%,  67%,  83%  and  100%.  Again,  a 
de  facto  shift  of  the  rejection  criterion  to  33%  would  occur  and  the  power  of 
the  test  would  be  reduced. 

It  would  be  expected  that  if  more  sample  sizes  had  been  included  in  the 
present  study,  then  each  of  the  curves  in  Figure  3  would  regularly  rise  and 
fall  as  N  increased  from  0  to  120.  Each  successive  maximum  would  be  a  little 
higher  than  the  last  due  to  increased  power  brought  about  by  increasing  N. 
Each  successive  minimum  would  also  be  higher  due  to  closer  approximation  of 
the  de  facto  rejection  criterion  to  the  desired  rejection  criterion.   The 
maxima  would  always  be  at  points  where  the  product  of  the  criterion  percent 
and  N  is  a  whole  number.   For  some  criteria  the  maxima  will  occur  with  N 
equal  to  an  integer.   For  others,  N  will  not  always  be  an  integer  but  will 
sometimes  be  an  integer  plus  a  fraction.   In  each  of  these  cases,  the  observed 
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maximum  would  be  the  integer  with  the  fraction  dropped.   This  would  be  the 
case  for  a  criterion  of  157o.   The  first  maximum  for  this  criterion  would 
theoretically  be  with  N=6.7.   The  observable  maximum  would  be  at  N=6;  however, 
since  a  sample  with  N=6.7  cannot  be  obtained. 

Efficiency  in  Terms  of  Type  I  Errors 

Thus  far,  the  discussion  has  concerned  the  efficiency  of  various 
sample  sizes  in  terms  of  the  percentage  of  unacceptable  frames  rejected. 
Efficiency  should  also  be  evaluated  in  terms  of  the  extent  of  type  I  errors. 
Examination  of  Table  4  reveals  that  large  numbers  of  acceptable  frames  were 
rejected  by  tests  based  on  the  smaller  samples.  When  these  numbers  are  presented 
as  percents  of  all  rejections  made,  as  in  Figure  4,  the  lower  efficiency  of 
smaller  sample  sizes  becomes  quite  clear.   For  samples  with  N  less  than  15, 
the  best  samples  led  to  the  rejection  of  at  least  one  acceptable  frame 
along  with  every  unacceptable  frame  rejected m   The  worst  rejected  as  many  as 
nine  acceptable  frames  for  every  unacceptable  frame  rejected.  The  curves  in 
Figure  4  as  in  Figure  3  have  probably  been  influenced  by  the  mean  overall 
error  rates  observed  with  the  various  sample  sizes.   Thus,  they  are  somewhat 
higher  where  N=5  and  somewhat  lower  at  N=15  than  would  normally  be  expected. 
Figure  4  definitely  shows,  however,  that  one  definite  advantage  of  obtaining 
a  large  pretesting  sample  is  that  erroneous  rejections  are  considerably 
reduced . 

It  is  apparent  that  the  percentage  of  the  rejections  made  erroneously  is 
also  a  function  of  the  rejection  criterion.  For  smaller  samples,  at  least, 
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the  percentage  of  rejections  made  erroneously  is  lowered  when  the  rejection 
criterion  is  reduced  from  a  20%  to  a  10%  error  rate.   Such  a  shift  in  the 
rejection  criterion  results  in  changing  at  least  three  things  which  would 
affect  the  percentage  of  rejections  made  erroneously.   First,  by  lowering  the 
rejection  criterion,  the  proportion  of  acceptable  frames  in  the  program  is 
reduced  and  the  proportion  of  unacceptable  frames  increased.   This  reduces 
the  probability  that  an  error  made  by  a  student  will  be  made  on  an  acceptable 
frame.   Thus,  fewer  acceptable  frames  are  rejected.   Second,  the  lowering  of 
the  criterion  places  more  frames  in  the  "almost  unacceptable  category"  due  to 
the  skewness  of  the  error  rate  distribution.   This  would  tend  to  increase 
erroneous  rejections  and  at  the  same  time  reduce  the  power  of  the  test  in 
making  proper  rejections,  since  more  "almost  unacceptable"  frames  would  be 
close  to  the  criterion  point.  The  effect  of  these  changes  would  be  to 
increase  the  percent  of  erroneous  rejections.  A  third  result  of  a  downward 
criterion  shift  would  be  to  reduce  the  mean  standard  error  of  the  acceptable 
frames  for  reasons  already  mentioned.   Such  a  reduction  in  variability  lowers 
the  percentage  of  erroneous  rejections.  With  a  large  N,  however,  this  effect 
is  considerably  smaller.  The  net  effect  of  all  these  changes  depends 
considerably  on  the  distribtuion  of  error  rates  in  the  program. 

Hazards  of  Small  Samples 

The  wide  variation  in  efficiency  among  samples  of  a  given  size,  in 
terms  of  rejecting  unacceptable  frames  and  failing  to  reject  acceptable  ones, 
also  obtains  where  the  prediction  of  overall  program  error  rate  is  concerned. 


■fi'S 


£>.'•'■■'   :,\  .:   '<-   iJjyiixJ 


■ 


■JOfcn      ■•    -j* 


: 


■■■'    ■       ■  .;•■    -  tci  its*!    ►»  ii.U.'-     :  .   .,  ,  ,.i&;,  ,  .j;,   ft  ^ 

'     ^C4B  .■    ;■    ;; 


5,'ri    b 


<lcs 


w 


i 

-  • 


>.:~-   ■  :  '        :  ' 


2.2 

This  is  .end  ly  seen  when  inspecting  the  overall  error  rate  estimates 
presented  in  Table  2. 

The  failure  to  obtain  consistent  results  with  smaller  samples  in  the 
present  study  Joints  up  a  major  objection  lo  cue  ..o   of  small  pretesting 
samples.   One  cannot  be  confident  that  a  small  sample  of  students  will  produce 
individual  and  overall  error  rates  consistent  with  those  which  would  obtain  in 
the  population  for  which  the  program  was  intended.   This  objection, along 
with  the  fact  that  erroneous  rejections  are  quite  frequent  when  small  samples 
are  employed,  must  be  seriously  considered  by  the  planner  of  a  pretesting 
program.  The  cost  of  failing  to  reject  unacceptable  items,  of  rejecting 
acceptable  items,  and  of  inaccurately  estimating  the  overall  error  rate  for  a 
program  must  be  balanced  against  the  cost  of  pretesting  the  program.  When 
these  th.ings  are  considered  the  N  of  the  pretesting  sample  should  be  set  as 
large  as  is  practical.   It  should  be  chosen  so  that  the  product  of  the 
desired  rejection  criterion  and  N  is  an  integer.   This  will  maximize  the  power 
of  the  test. 

Summary 
/ 
In  spite  of  the  common  practice,  in  developing  programed  leading 

materials,  of  using  small  samples  of  students  from  the  target  population  to 

accept  or  reject  frames,  there  has  been  no  examination  of  the  implications 

of  this  practice.   This  study  relates  the  problem  to  the  problem  of  the 

statistician  who  is  testing  a  large  number  of  hypotheses.   The  concepts  of 

rejection  level,  type  I  and  II  errors,  and  the  statistical  concept  of  the  power 
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of  a  test  are  applied.   The  empirical  nature  of  the  study  is  important  since 
it  is  characteristic  of  the  errors  made  to  be  intercorrelated  and  to  form  a 
skewed  distribution  with  a  mean  that  departs  substantially  from  .5.   Twenty- 
one  independent  samples  of  seven  different  sizes  and  three  per  size  were  drawn 
from  student  worksheets  used  in  learning  from  an  algebra  program  based  upon 
the  UICSM  curriculum.   The  hazards  of  small  samples  (up  to  N=15)  with  rejection 
criterion  levels  of  10%,  15%  and  20%  were  examined.  Wide  variations  in 
efficiency  among  samples  of  a  given  size  were  observed  both  in  terms  of  (a) 
rejection  of  acceptable  frames,  and  (b)  failing  to  reject  unacceptable  ones. 
Coupled  with  the  inconsistency  of  small  pretesting  samples  is  the  high 
frequency  of  erroneous  rejections.   It  was  recommended  that  pretest  samples 
be  both  as  large  as  practical  and  chosen,  so  that  the  product  of  the  desired 
rejection  criterion  and  N  are  integers,  so  as  to  maximize  the  power  of  the 
test. 
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